• DOMAIN: Telecom • CONTEXT: A telecom company wants to use their historical customer data to predict behaviour to retain customers. You can analyse all relevant customer data and develop focused customer retention programs. • DATA DESCRIPTION: Each row represents a customer, each column contains customer’s attributes described on the column Metadata. The data set includes information about: • Customers who left within the last month – the column is called Churn • Services that each customer has signed up for – phone, multiple lines, internet, online security, online backup, device protection, tech support, and streaming TV and movies • Customer account information – how long they’ve been a customer, contract, payment method, paperless billing, monthly charges, and total charges • Demographic info about customers – gender, age range, and if they have partners and dependents • PROJECT OBJECTIVE: Build a model that will help to identify the potential customers who have a higher probability to churn. This help the company to understand the pinpoints and patterns of customer churn and will increase the focus on strategising customer retention. • Steps to the project: [ Total score: 60 points ]

  1. Import and warehouse data: [ Score: 5 point ] • Import all the given datasets. Explore shape and size. • Merge all datasets onto one and explore final shape and size.
  2. Data cleansing: [ Score: 10 point ] • Missing value treatment • Convert categorical attributes to continuous using relevant functional knowledge • Drop attribute/s if required using relevant functional knowledge • Automate all the above steps
  3. Data analysis & visualisation: [ Score: 10 point ] • Perform detailed statistical analysis on the data. • Perform a detailed univariate, bivariate and multivariate analysis with appropriate detailed comments after each analysis.
  4. Data pre-processing: [ Score: 5 point ] • Segregate predictors vs target attributes • Check for target balancing and fix it if found imbalanced. • Perform train-test split. • Check if the train and test data have similar statistical characteristics when compared with original data.
  5. Model training, testing and tuning: [ Score: 25 point ] • Train and test all ensemble models taught in the learning module. • Suggestion: Use standard ensembles available. Also you can design your own ensemble technique using weak classifiers. • Display the classification accuracies for train and test data. • Apply all the possible tuning techniques to train the best model for the given data. • Suggestion: Use all possible hyper parameter combinations to extract the best accuracies. • Display and compare all the models designed with their train and test accuracies. • Select the final best trained model along with your detailed comments for selecting this model. • Pickle the selected model for future use.
  6. Conclusion and improvisation: [ Score: 5 point ] • Write your conclusion on the results. • Detailed suggestions or improvements or on quality, quantity, variety, velocity, veracity etc. on the data points collected by the telecom operator to perform a better data analysis in future.

Data cleansing

Univariate Analysis

BIVARIATE

Multivariate

Lets convert the columns with an 'object' datatype into categorical variables

Get value counts for every column

Hence, data type for TotalCharges is int

SPLIT DATA

Build Decision Tree Model

We will build our model using the DecisionTreeClassifier function. Using default 'gini' criteria to split. Other option include 'entropy'.

Scoring our Decision Tree

Hence, this decision tree model is 'overfitted

Visualizing the Decision Tree

Using plot_tree method from sklearn.tree

Reducing over fitting (Regularization)

Hence, we can observe that tenure, InternetService and Contract play important role for our final prediction

Confusion matrix

Ensemble Learning - Bagging

Ensemble Learning - AdaBoosting

Ensemble Learning - GradientBoost

Ensemble RandomForest Classifier

conclusion

Accuracy on training set for decision tree:0.9970055611008127 Accuracy on training set for regularised decision tree:0.7979176526265973 Accuracy on training set for decision tree with bagging: 0.8333333333333334 Accuracy on training set for decision tree with adaptive boosting: 0.8666666666666667 Accuracy on training set for decision tree with gradient boosting: 0.9 Accuracy on training set for random forest classifier: 0.9

Hence, we can conclude that the decision tree model with adaptive boosting gives us the best accuracy.

Also, observed that bagging classifiers in general benefit from having complex individual models and boosting classifiers in general benefit from having simple models